null 2
A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation
Zhao, Heyang, He, Jiafan, Gu, Quanquan
The exploration-exploitation dilemma has been a central challenge in reinforcement learning (RL) with complex model classes. In this paper, we propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for RL with general function approximation. Our key algorithmic design includes (1) a general deterministic policy-switching strategy that achieves low switching cost, (2) a monotonic value function structure with carefully controlled function class complexity, and (3) a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O}(d\sqrt{HK})$ when $K$ is sufficiently large and near-optimal policy switching cost of $\tilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$ being the planning horizon, and $K$ being the number of episodes. Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
Anytime Model Selection in Linear Bandits
Kassraie, Parnian, Emmenegger, Nicolas, Krause, Andreas, Pacchiano, Aldo
Model selection in the context of bandit optimization is a challenging problem, as it requires balancing exploration and exploitation not only for action selection, but also for model selection. One natural approach is to rely on online learning algorithms that treat different models as experts. Existing methods, however, scale poorly ($\text{poly}M$) with the number of models $M$ in terms of their regret. Our key insight is that, for model selection in linear bandits, we can emulate full-information feedback to the online learner with a favorable bias-variance trade-off. This allows us to develop ALEXP, which has an exponentially improved ($\log M$) dependence on $M$ for its regret. ALEXP has anytime guarantees on its regret, and neither requires knowledge of the horizon $n$, nor relies on an initial purely exploratory stage. Our approach utilizes a novel time-uniform analysis of the Lasso, establishing a new connection between online learning and high-dimensional statistics.
- Research Report (0.63)
- Overview (0.46)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration
Liu, Zhihan, Lu, Miao, Xiong, Wei, Zhong, Han, Hu, Hao, Zhang, Shenao, Zheng, Sirui, Yang, Zhuoran, Wang, Zhaoran
In online reinforcement learning (online RL), balancing exploration and exploitation is crucial for finding an optimal policy in a sample-efficient way. To achieve this, existing sample-efficient online RL algorithms typically consist of three components: estimation, planning, and exploration. However, in order to cope with general function approximators, most of them involve impractical algorithmic components to incentivize exploration, such as optimization within data-dependent level-sets or complicated sampling procedures. To address this challenge, we propose an easy-to-implement RL framework called \textit{Maximize to Explore} (\texttt{MEX}), which only needs to optimize \emph{unconstrainedly} a single objective that integrates the estimation and planning components while balancing exploration and exploitation automatically. Theoretically, we prove that \texttt{MEX} achieves a sublinear regret with general function approximations for Markov decision processes (MDP) and is further extendable to two-player zero-sum Markov games (MG). Meanwhile, we adapt deep RL baselines to design practical versions of \texttt{MEX}, in both model-free and model-based manners, which can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards. Compared with existing sample-efficient online RL algorithms with general function approximations, \texttt{MEX} achieves similar sample efficiency while enjoying a lower computational cost and is more compatible with modern deep RL methods.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
Utilising the CLT Structure in Stochastic Gradient based Sampling : Improved Analysis and Faster Algorithms
Das, Aniket, Nagaraj, Dheeraj, Raj, Anant
We consider stochastic approximations of sampling algorithms, such as Stochastic Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM) for Interacting Particle Dynamcs (IPD). We observe that the noise introduced by the stochastic approximation is nearly Gaussian due to the Central Limit Theorem (CLT) while the driving Brownian motion is exactly Gaussian. We harness this structure to absorb the stochastic approximation error inside the diffusion process, and obtain improved convergence guarantees for these algorithms. For SGLD, we prove the first stable convergence rate in KL divergence without requiring uniform warm start, assuming the target density satisfies a Log-Sobolev Inequality. Our result implies superior first-order oracle complexity compared to prior works, under significantly milder assumptions. We also prove the first guarantees for SGLD under even weaker conditions such as H\"{o}lder smoothness and Poincare Inequality, thus bridging the gap between the state-of-the-art guarantees for LMC and SGLD. Our analysis motivates a new algorithm called covariance correction, which corrects for the additional noise introduced by the stochastic approximation by rescaling the strength of the diffusion. Finally, we apply our techniques to analyze RBM, and significantly improve upon the guarantees in prior works (such as removing exponential dependence on horizon), under minimal assumptions.
- Asia > India (0.14)
- North America > United States > Illinois (0.14)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.85)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.67)
Inverse Extended Kalman Filter -- Part II: Highly Non-Linear and Uncertain Systems
Singh, Himali, Chattopadhyay, Arpan, Mishra, Kumar Vijay
For example, inverse Kalman filter (I-KF) has been recently formulated to estimate the adversary's Kalman-filter-tracked estimates and hence, predict the adversary's future steps. The purpose of this paper and the companion paper (Part I) is to address the inverse filtering problem in non-linear systems by proposing an inverse extended Kalman filter (I-EKF). The companion paper proposed the theory of I-EKF (with and without unknown inputs) and I-KF (with unknown inputs). In this paper, we develop this theory for highly non-linear models, which employ second-order, Gaussian sum, and dithered forward EKFs. In particular, we derive theoretical stability guarantees for the inverse second-order EKF using the bounded non-linearity approach. T o address the limitation of the standard I-EKFs that the system model and forward filter are perfectly known to the defender, we propose reproducing kernel Hilbert space-based EKF to learn the unknown system dynamics based on its observations, which can be employed as an inverse filter to infer the adversary's estimate. Numerical experiments demonstrate the state estimation performance of the proposed filters using recursive Cram er-Rao lower bound as a benchmark. Index T erms--Bayesian filtering, counter-adversarial systems, extended Kalman filter, inverse filtering, non-linear processes.
- North America > United States (0.67)
- Asia > India (0.14)
- Information Technology > Sensing and Signal Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Wasserstein Mirror Gradient Flow as the limit of the Sinkhorn Algorithm
Deb, Nabarun, Kim, Young-Heon, Pal, Soumik, Schiebinger, Geoffrey
We prove that the sequence of marginals obtained from the iterations of the Sinkhorn algorithm or the iterative proportional fitting procedure (IPFP) on joint densities, converges to an absolutely continuous curve on the $2$-Wasserstein space, as the regularization parameter $\varepsilon$ goes to zero and the number of iterations is scaled as $1/\varepsilon$ (and other technical assumptions). This limit, which we call the Sinkhorn flow, is an example of a Wasserstein mirror gradient flow, a concept we introduce here inspired by the well-known Euclidean mirror gradient flows. In the case of Sinkhorn, the gradient is that of the relative entropy functional with respect to one of the marginals and the mirror is half of the squared Wasserstein distance functional from the other marginal. Interestingly, the norm of the velocity field of this flow can be interpreted as the metric derivative with respect to the linearized optimal transport (LOT) distance. An equivalent description of this flow is provided by the parabolic Monge-Amp\`{e}re PDE whose connection to the Sinkhorn algorithm was noticed by Berman (2020). We derive conditions for exponential convergence for this limiting flow. We also construct a Mckean-Vlasov diffusion whose marginal distributions follow the Sinkhorn flow.
- Europe > United Kingdom (0.14)
- North America > Canada > British Columbia (0.14)
- North America > United States > Washington > King County > Seattle (0.14)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection
Bai, Yu, Chen, Fan, Wang, Huan, Xiong, Caiming, Mei, Song
Neural sequence models based on the transformer architecture have demonstrated remarkable \emph{in-context learning} (ICL) abilities, where they can perform new tasks when prompted with training and test examples, without any parameter update to the model. This work first provides a comprehensive statistical theory for transformers to perform ICL. Concretely, we show that transformers can implement a broad class of standard machine learning algorithms in context, such as least squares, ridge regression, Lasso, learning generalized linear models, and gradient descent on two-layer neural networks, with near-optimal predictive power on various in-context data distributions. Using an efficient implementation of in-context gradient descent as the underlying mechanism, our transformer constructions admit mild size bounds, and can be learned with polynomially many pretraining sequences. Building on these ``base'' ICL algorithms, intriguingly, we show that transformers can implement more complex ICL procedures involving \emph{in-context algorithm selection}, akin to what a statistician can do in real life -- A \emph{single} transformer can adaptively select different base ICL algorithms -- or even perform qualitatively different tasks -- on different input sequences, without any explicit prompting of the right algorithm or task. We both establish this in theory by explicit constructions, and also observe this phenomenon experimentally. In theory, we construct two general mechanisms for algorithm selection with concrete examples: pre-ICL testing, and post-ICL validation. As an example, we use the post-ICL validation mechanism to construct a transformer that can perform nearly Bayes-optimal ICL on a challenging task -- noisy linear models with mixed noise levels. Experimentally, we demonstrate the strong in-context algorithm selection capabilities of standard transformer architectures.
- Materials > Chemicals > Industrial Gases > Liquified Gas (0.45)
- Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.45)
- Energy > Oil & Gas > Midstream (0.45)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
On Quantum Speedups for Nonconvex Optimization via Quantum Tunneling Walks
Liu, Yizhou, Su, Weijie J., Li, Tongyang
Classical algorithms are often not effective for solving nonconvex optimization problems where local minima are separated by high barriers. In this paper, we explore possible quantum speedups for nonconvex optimization by leveraging the global effect of quantum tunneling. Specifically, we introduce a quantum algorithm termed the quantum tunneling walk (QTW) and apply it to nonconvex problems where local minima are approximately global minima. We show that QTW achieves quantum speedup over classical stochastic gradient descents (SGD) when the barriers between different local minima are high but thin and the minima are flat. Based on this observation, we construct a specific double-well landscape, where classical algorithms cannot efficiently hit one target well knowing the other well but QTW can when given proper initial states near the known well. Finally, we corroborate our findings with numerical experiments.
- North America > United States (0.45)
- Asia > China (0.14)
Variable-Complexity Weighted-Tempered Gibbs Samplers for Bayesian Variable Selection
Subset weighted-Tempered Gibbs Sampler (wTGS) has been recently introduced by Jankowiak to reduce the computation complexity per MCMC iteration in high-dimensional applications where the exact calculation of the posterior inclusion probabilities (PIP) is not essential. However, the Rao-Backwellized estimator associated with this sampler has a high variance as the ratio between the signal dimension and the number of conditional PIP estimations is large. In this paper, we design a new subset weighted-Tempered Gibbs Sampler (wTGS) where the expected number of computations of conditional PIPs per MCMC iteration can be much smaller than the signal dimension. Different from the subset wTGS and wTGS, our sampler has a variable complexity per MCMC iteration. We provide an upper bound on the variance of an associated Rao-Blackwellized estimator for this sampler at a finite number of iterations, $T$, and show that the variance is $O\big(\big(\frac{P}{S}\big)^2 \frac{\log T}{T}\big)$ for a given dataset where $S$ is the expected number of conditional PIP computations per MCMC iteration. Experiments show that our Rao-Blackwellized estimator can have a smaller variance than its counterpart associated with the subset wTGS.
A Non-Asymptotic Framework for Approximate Message Passing in Spiked Models
Approximate message passing (AMP) emerges as an effective iterative paradigm for solving high-dimensional statistical problems. However, prior AMP theory -- which focused mostly on high-dimensional asymptotics -- fell short of predicting the AMP dynamics when the number of iterations surpasses $o\big(\frac{\log n}{\log\log n}\big)$ (with $n$ the problem dimension). To address this inadequacy, this paper develops a non-asymptotic framework for understanding AMP in spiked matrix estimation. Built upon new decomposition of AMP updates and controllable residual terms, we lay out an analysis recipe to characterize the finite-sample behavior of AMP in the presence of an independent initialization, which is further generalized to allow for spectral initialization. As two concrete consequences of the proposed analysis recipe: (i) when solving $\mathbb{Z}_2$ synchronization, we predict the behavior of spectrally initialized AMP for up to $O\big(\frac{n}{\mathrm{poly}\log n}\big)$ iterations, showing that the algorithm succeeds without the need of a subsequent refinement stage (as conjectured recently by \citet{celentano2021local}); (ii) we characterize the non-asymptotic behavior of AMP in sparse PCA (in the spiked Wigner model) for a broad range of signal-to-noise ratio.